THE PROBLEM
Google’s Chrome developer relations team consider the “big four” browsers to be: (in no particular order) Chrome, Firefox, Edge, and Safari. Attention was brought to the fact that, according to statcounter.com, Opera had a usage of 30%+ in some countries, whilst less than 5% worldwide.
The goal is to reach 95%+ of a given population, and if the target population happened to include these countries, then clearly Opera must be considered to be a part of the main base.
Although the specific reasons for this outlier became clear with some investigation, the question remained: What else are we missing?
On a global scale, it’s fairly easy to collect logs from published sources such as CDNs (Content Distribution Networks), and to determine what global usage is, but if your target audience isn’t global, where are developer efforts best served?
Note: all figures must be taken as given, though each can easily be disputed; A user-agent string is easily manipulated or faked; collecting from a single source will tend toward certain repeat types of traffic; many countries have internal or closed networks, etc.
A SOLUTION
There is no single solution, the source data is not perfect, and even if it were, anomalies are often explainable. The solution presented was to collect statistics where possible, and to apply a learning model to automatically detect and report data outliers – data points that do not fit neatly into the data learning model.
TOOLS
Using freely-available sources, world-population data from worldmeters.info, and browser usage and history from statcounter.com, this data was pulled in and entered into local storage for examination.
Using Python’s data-handling and learning libraries:
- Pandas – Python Data Analysis
- SciPy – Scientific Python
- Sklearn – Learning algorithms
- PyOD - Python Outlier Detection
Visualisations were generated using:
- Pyplot from Matplotlib
METHODOLOGY
Using the global statistics as a basis, the data is split into segments, browser-usage globally and per-country. This data is then loaded into a kNN detection model. kNN is…
For information on nKK see this aricle on medium com
Other Machine Learning algorithms may well produce alternate results and could certainly be considered, but kNN is known to excel at data classification.
Each data point is the percentile use of a given browser, per locale segment.
Using Euclidian distance (a VERY fancy way of saying “a straight line”), each data point is classified into a…class. This class is then larger for the next points to be entered. If a data point cannot be fit into one of the data groups, then this is considered to be an outlier, an odd data point, and this is precisely what we are looking for.
The model measures distance to the nearest 5 neighbour groups. This can very roughly be considered to be each of the main 4 browsers, plus one other (there are a great many browsers that are never in use at all in a given country, so a usage of 0.0% is quite common).
PyOD commonly uses a contamination vector, a number that is provided to indicate how accurate the data is, or how much you expect to be outliers, and to help suppress this. Effectively, a cv of 0% will detect every single slight outlier, whilst a cv of 99% will barely detect anything. Initial tests indicated that 2% provided a good balance.
A parameter specific to kNN is the method used determine the nearest-neighbour, a commonly-used value is “mean”, which uses the average of all k neighbour distances, but in this case we used “largest” since we specifically aim to determine if a browser usage is outside of the norm for that specific browser, not just if it’s abnormal overall.
Using these vectors, the data is fit into the model, which is classified as 0 or 1 – it fits into the model, or it does not. Per data-set, an ROC (Receiver operating characteristic) and precision are generated. The ROC can be used to create an AUC (area under curve) metric.
Information on ROC and AUC can be found at this link at Developers for Chrome
Precision is how much of the data we have is used in a classification. A very low precision means that a classification is made based on very little data, and is unlikely to be factual.
An AUC is a metric which indicates the probability of a successful prediction based on the model – how confident that the learning algorithm will successfully place the next data point.
Precision metrics of <0.75 (less than 75%), and an AUC of <0.85 (less than 85%) generated a warning, and were not considered for the final output.
Further information on KNN Precision and Recall can be found here
RESULTS
Once classified, all outliers/abnormalities were output, and mapped against a country with additional data loaded, such as population.
Lack of variance errors are extremely likely due to many browsers being a localisation of another, and only popular in it’s host country. Therefore, many browsers score 0% for many countries, and this potentially fatal data modelling exception is expected and handled.